AGU 2021 Open Science in Action Tutorials:

Openscapes NASA-Earthdata

NASA Earthdata Access in the Cloud Using Open-source Libraries

Acknowledgements:

Each of the following co-authors contributed to the following materials and code examples, as well as collaboration infrastructure for the NASA Earthdata Openscapes Project: * Julia S. Stewart Lowndes; Openscapes, NCEAS * Erin Robinson; Openscapes, Metadata Game Changers * Catalina M Oaida; NASA PO.DAAC, NASA Jet Propulsion Laboratory * Luis Alberto Lopez; NASA National Snow and Ice Data Center DAAC * Aaron Friesz; NASA Land Processes DAAC * Andrew P Barrett; NASA National Snow and Ice Data Center DAAC * Makhan Virdi; NASA ASDC DAAC * Jack McNelis; NASA PO.DAAC, NASA Jet Propulsion Laboratory

Additional credit to the entire NASA Earthdata Openscapes Project community, Patrick Quinn at Element84, and to2i2c for our Cloud computing infrastructure

Outline:

  1. Introduction to NASA Earthdata’s move to the cloud
    • Background and motivation
    • Enabling Open Science via “Analysis-in-Place”
    • Resources for cloud adopters: NASA Earthdata Openscapes
  2. NASA Earthdata discovery and access in the cloud
    • Part 1: Explore Earthdata cloud data availablity
    • Part 2: Working with Cloud-Optimized GeoTIFFs using NASA’s Common Metadata Repository Spatio-Temporal Assett Catalog (CMR-STAC)
    • Part 3: Working with Zarr-formatted data using NASA’s Harmony cloud transformation service
Tutorial materials are adapted from repos on the NASA Openscapes public Github:
  • This notebook source code: update https://github.com/NASA-Openscapes/2021-Cloud-Workshop-AGU/tree/main/how-tos
  • Also available via online Quarto book: update https://nasa-openscapes.github.io/2021-Cloud-Workshop-AGU/

The NASA Earthdata archive continues to grow

The NASA Earthdata Cloud Evolution

Cloud Evolution EOSDIS Growth

  • NASA Distributed Active Archive Centers (DAACs) are continuing to migrate data to the Earthdata Cloud
    • Supporting increased data volume as new, high-resolution remote sensing missions launch in the coming years
    • Data hosted via Amazon Web Services, or AWS
    • DAACs continuing to support tools, services, and tutorial resources for our user communities

NASA Earthdata Cloud as an enabler of Open Science

  • Reducing barriers to large-scale scientific research in the era of “big data”
  • Increasing community contributions with hands-on engagement
  • Promoting reproducible and shareable workflows without relying on local storage systems

Data and Analysis co-located “in place”

Building NASA Earthdata Cloud Resources

Show slide with 3 panels of user resources

NASA Earthdata Cloud: Discovery and access using open source technologies

The following tutorial demonstrates several basic end-to-end workflows to interact with data “in-place” from the NASA Earthdata Cloud, accessing Amazon Web Services (AWS) Single Storage Solution (S3) data locations without the need to download data. While the data can be downloaded locally, the cloud offers the ability to scale compute resources to perform analyses over large areas and time spans, which is critical as data volumes continue to grow.

Although the examples we’re working with in this notebook only focuses on a small time and area for demonstration purposes, this workflow can be modified and scaled up to suit a larger time range and region of interest.

Datasets of interest:

  • Harmonized Landsat Sentinel-2 (HLS) Operational Land Imager Surface Reflectance and TOA Brightness Daily Global 30m v2.0 (L30) (10.5067/HLS/HLSL30.002)

  • Monthly sea surface height from ECCO V4r4 (10.5067/ECG5D-SSH44). The data are provided as a time series of monthly netCDFs on a 0.5-degree latitude/longitude grid. (From NetCDF notebook: We will access the data from inside the AWS cloud (us-west-2 region, specifically) and load a time series made of multiple netCDF datasets into a single xarray dataset. This approach leverages S3 native protocols for efficient access to the data). In this example we’re interested in the ECCO data collection from NASA’s PO.DAAC in Earthdata Cloud. In this case it’s the following string that unique identifies the collection of monthly, 0.5-degree sea surface height data (ECCO_L4_SSH_05DEG_MONTHLY_V4R4).

Part 1: Explore Earthdata cloud data availablity

Earthdata Search exploration to get s3 URLs

Using Earthdata Search to find data hosted in the NASA Earthdata Cloud

From Earthdata Search https://search.earthdata.nasa.gov, use your Earthdata login credentials to log in. You can create an Earthdata Login account at https://urs.earthdata.nasa.gov.

In this example we are interested in the ECCO dataset, hosted by the PO.DAAC. This dataset is available from the NASA Earthdata Cloud archive hosted in AWS cloud.

Click on the “Available from AWS Cloud” filter option on the left. Here, 39 matching collections were found with the ECCO monthly SSH search, and for the time period for year 2015. The latter can be done using the calendar icon on the left under the search box. Scroll down the list of returned matches until we see the dataset of interest, in this case ECCO Sea Surface Height - Monthly Mean 0.5 Degree (Version 4 Release 4).

View and Select Data Access Options

Clicking on the ECCO Sea Surface Height - Monthly Mean 0.5 Degree (Version 4 Release 4) dataset, we now see a list of files (granules) that are part of the dataset (collection). We can click on the green + symbol to add a few files to our project. Here we added the first 3 listed for 2015. Then click on the green button towards the bottom that says “Download”. This will take us to another page with options to customize our download or access link(s).

Access Options

Let’s stay we are interested in the entire file content, so we select the “Direct Download” option (as opposed to other options to subset or transform the data):

Clicking the green Download Data button again, will take us to the final page for instructions to download and links for data access in the cloud. You should see three tabs: Download Files, AWS S3 Access, Download Script:

Figure caption: Download to local Figure caption: Direct S3 access Figure caption: Downloading scrip

The Download Files tab provides the https:// links for downloading the files locally. E.g.: https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/ECCO_L4_SSH_05DEG_MONTHLY_V4R4/SEA_SURFACE_HEIGHT_mon_mean_2015-09_ECCO_V4r4_latlon_0p50deg.nc

The AWS S3 Access tab provides the S3:// links, which is what we would use to access the data directly in-region (us-west-2) within the AWS cloud (an example will be shown in Tutorial 3). E.g.: s3://podaac-ops-cumulus-protected/ECCO_L4_SSH_05DEG_MONTHLY_V4R4/SEA_SURFACE_HEIGHT_mon_mean_2015-09_ECCO_V4r4_latlon_0p50deg.nc where s3 indicates data is stored in AWS S3 storage, podaac-ops-cumulus-protected is the bucket, and ECCO_L4_SSH_05DEG_MONTHLY_V4R4 is the object prefix (the latter two are also listed in the dataset collection information under Cloud Access (step 3 above)).

Tip: Another quicker way to find the bucket and object prefix is from the list of data files the search returns. Next to the + green button is a grey donwload symbol. Click on that to see the Download Files https:// links or on the AWS S3 Access to get the direct S3:// access links, which contain the bucket and object prefix where data is stored.

4.b. Subset or transform before download or access

DAAC tools and services are also being migrated or developed in the cloud, next to that data. These include the Harmony API and OPeNDAP in the cloud, as a few examples.

We can leverage these cloud-based services on cloud-archived data to reduce or transform the data (depending on need) before getting the access links regardless of whether we prefer to download the data and work on a local machine or whether we want to access the data in the cloud (from a cloud workspace). These can be useful data reduction services that support a faster time to science.

Harmony

Harmony allows you to seamlessly analyze Earth observation data from different NASA data centers. These services (API endpoints) provide data reduction (e.g. subsetting) and transfromation services (e.g. convert netCDF data to Zarr cloud optimized format).

When you click the final green Download button, the links provided are to data that had been transformed based on our selections on the previous screen (here chosing to use the Harmony service to reformat the data to Zarr). These data are staged for us in an S3 bucket in AWS, and we can use the s3:// links to access those specific data. This service also provides STAC access links. This particular example is applicable if your workflow is in the AWS us-west-2 region.

Part 2: Working with Cloud-Optimized GeoTIFFs using NASA’s Common Metadata Repository Spatio-Temporal Assett Catalog (CMR-STAC)

In this example we will access the NASA’s Harmonized Landsat Sentinel-2 (HLS) version 2 assets, which are archived in cloud optimized geoTIFF (COG) format archived by the Land Processes (LP) DAAC. The COGs can be used like any other GeoTIFF file, but have some added features that make them more efficient within the cloud data access paradigm. These features include: overviews and internal tiling.

But first, what is STAC?

SpatioTemporal Asset Catalog (STAC) is a specification that provides a common language for interpreting geospatial information in order to standardize indexing and discovering data.

The STAC specification is made up of a collection of related, yet independent specifications that when used together provide search and discovery capabilities for remote assets.

Four STAC Specifications

STAC Catalog (aka DAAC Archive)
STAC Collection (aka Data Product)
STAC Item (aka Granule)
STAC API

CMR-STAC API

The CMR-STAC API is NASA’s implementation of the STAC API specification for all NASA data holdings within EOSDIS. The current implementation does not allow for querries accross the entire NASA catalog. Users must execute searches within provider catalogs (e.g., LPCLOUD) to find the STAC Items they are searching for. All the providers can be found at the CMR-STAC endpoint here: https://cmr.earthdata.nasa.gov/stac/.

In this example, we will query the LPCLOUD provider to identify STAC Items from the Harmonized Landsat Sentinel-2 (HLS) collection that fall within our region of interest (ROI) and within our specified time range.

Import packages

import os
import requests 
import boto3
from osgeo import gdal
import rasterio as rio
from rasterio.session import AWSSession
import rioxarray
import hvplot.xarray
import holoviews as hv

from pystac_client import Client  
from collections import defaultdict    
import json
import geopandas
import geoviews as gv
from cartopy import crs
gv.extension('bokeh', 'matplotlib')

from harmony import BBox, Client, Collection, Request, LinkType
from harmony.config import Environment
from pprint import pprint
import datetime as dt
import s3fs
import xarray as xr

Connect to the CMR-STAC API

STAC_URL = 'https://cmr.earthdata.nasa.gov/stac'
provider_cat = Client.open(STAC_URL)

Connect to the LPCLOUD Provider/STAC Catalog

For this next step we need the provider title (e.g., LPCLOUD). We will add the provider to the end of the CMR-STAC API URL (i.e., https://cmr.earthdata.nasa.gov/stac/) to connect to the LPCLOUD STAC Catalog.

catalog = Client.open(f'{STAC_URL}/LPCLOUD/')

Since we are using a dedicated client (i.e., pystac-client.Client) to connect to our STAC Provider Catalog, we will have access to some useful internal methods and functions (e.g., get_children() or get_all_items()) we can use to get information from these objects.

Search for STAC Items

We will define our ROI using a geojson file containing a small polygon feature in western Nebraska, USA. We’ll also specify the data collections and a time range for our example.

Read in a geojson file and plot

Reading in a geojson file with geopandas and extract coodinates for our ROI.

field = geopandas.read_file('../data/ne_w_agfields.geojson')
fieldShape = field['geometry'][0]
roi = json.loads(field.to_json())['features'][0]['geometry']

We can plot the polygon using the geoviews package that we imported as gv with ‘bokeh’ and ‘matplotlib’ extensions. The following has reasonable width, height, color, and line widths to view our polygon when it is overlayed on a base tile map.

base = gv.tile_sources.EsriImagery.opts(width=650, height=500)
farmField = gv.Polygons(fieldShape).opts(line_color='yellow', line_width=10, color=None)
base * farmField
Unable to display output for mime type(s): 

We will now start to specify the search criteria we are interested in, i.e, the date range, the ROI, and the data collections, that we will pass to the STAC API.

Search the CMR-STAC API with our search criteria

Now we can put all our search criteria together using catalog.search from the pystac_client package. STAC Collection is synonomous with what we usually consider a NASA data product. Desired STAC Collections are submitted to the search API as a list containing the collection id. Let’s focus on S30 and L30 collections.

collections = ['HLSL30.v2.0', 'HLSS30.v2.0']

date_range = "2021-05/2021-08"

search = catalog.search(
    collections=collections,
    intersects=roi,
    datetime=date_range,
    limit=100
)

View STAC Items that matched our search query

print('Matching STAC Items:', search.matched())
item_collection = search.get_all_items()
item_collection[0].to_dict()
Matching STAC Items: 113
{'type': 'Feature',
 'stac_version': '1.0.0',
 'id': 'HLS.L30.T13TGF.2021124T173013.v2.0',
 'properties': {'datetime': '2021-05-04T17:30:13.428000Z',
  'start_datetime': '2021-05-04T17:30:13.428Z',
  'end_datetime': '2021-05-04T17:30:37.319Z',
  'eo:cloud_cover': 36},
 'geometry': {'type': 'Polygon',
  'coordinates': [[[-101.5423534, 40.5109845],
    [-101.3056118, 41.2066375],
    [-101.2894253, 41.4919436],
    [-102.6032964, 41.5268623],
    [-102.638891, 40.5386175],
    [-101.5423534, 40.5109845]]]},
 'links': [{'rel': 'self',
   'href': 'https://cmr.earthdata.nasa.gov/stac/LPCLOUD/collections/HLSL30.v2.0/items/HLS.L30.T13TGF.2021124T173013.v2.0'},
  {'rel': 'parent',
   'href': 'https://cmr.earthdata.nasa.gov/stac/LPCLOUD/collections/HLSL30.v2.0'},
  {'rel': 'collection',
   'href': 'https://cmr.earthdata.nasa.gov/stac/LPCLOUD/collections/HLSL30.v2.0'},
  {'rel': <RelType.ROOT: 'root'>,
   'href': 'https://cmr.earthdata.nasa.gov/stac/LPCLOUD/',
   'type': <MediaType.JSON: 'application/json'>,
   'title': 'LPCLOUD'},
  {'rel': 'provider', 'href': 'https://cmr.earthdata.nasa.gov/stac/LPCLOUD'},
  {'rel': 'via',
   'href': 'https://cmr.earthdata.nasa.gov/search/concepts/G2144020713-LPCLOUD.json'},
  {'rel': 'via',
   'href': 'https://cmr.earthdata.nasa.gov/search/concepts/G2144020713-LPCLOUD.umm_json'}],
 'assets': {'B11': {'href': 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/HLSL30.020/HLS.L30.T13TGF.2021124T173013.v2.0/HLS.L30.T13TGF.2021124T173013.v2.0.B11.tif',
   'title': 'Download HLS.L30.T13TGF.2021124T173013.v2.0.B11.tif'},
  'B07': {'href': 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/HLSL30.020/HLS.L30.T13TGF.2021124T173013.v2.0/HLS.L30.T13TGF.2021124T173013.v2.0.B07.tif',
   'title': 'Download HLS.L30.T13TGF.2021124T173013.v2.0.B07.tif'},
  'SAA': {'href': 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/HLSL30.020/HLS.L30.T13TGF.2021124T173013.v2.0/HLS.L30.T13TGF.2021124T173013.v2.0.SAA.tif',
   'title': 'Download HLS.L30.T13TGF.2021124T173013.v2.0.SAA.tif'},
  'B06': {'href': 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/HLSL30.020/HLS.L30.T13TGF.2021124T173013.v2.0/HLS.L30.T13TGF.2021124T173013.v2.0.B06.tif',
   'title': 'Download HLS.L30.T13TGF.2021124T173013.v2.0.B06.tif'},
  'B09': {'href': 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/HLSL30.020/HLS.L30.T13TGF.2021124T173013.v2.0/HLS.L30.T13TGF.2021124T173013.v2.0.B09.tif',
   'title': 'Download HLS.L30.T13TGF.2021124T173013.v2.0.B09.tif'},
  'B10': {'href': 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/HLSL30.020/HLS.L30.T13TGF.2021124T173013.v2.0/HLS.L30.T13TGF.2021124T173013.v2.0.B10.tif',
   'title': 'Download HLS.L30.T13TGF.2021124T173013.v2.0.B10.tif'},
  'VZA': {'href': 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/HLSL30.020/HLS.L30.T13TGF.2021124T173013.v2.0/HLS.L30.T13TGF.2021124T173013.v2.0.VZA.tif',
   'title': 'Download HLS.L30.T13TGF.2021124T173013.v2.0.VZA.tif'},
  'SZA': {'href': 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/HLSL30.020/HLS.L30.T13TGF.2021124T173013.v2.0/HLS.L30.T13TGF.2021124T173013.v2.0.SZA.tif',
   'title': 'Download HLS.L30.T13TGF.2021124T173013.v2.0.SZA.tif'},
  'B01': {'href': 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/HLSL30.020/HLS.L30.T13TGF.2021124T173013.v2.0/HLS.L30.T13TGF.2021124T173013.v2.0.B01.tif',
   'title': 'Download HLS.L30.T13TGF.2021124T173013.v2.0.B01.tif'},
  'VAA': {'href': 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/HLSL30.020/HLS.L30.T13TGF.2021124T173013.v2.0/HLS.L30.T13TGF.2021124T173013.v2.0.VAA.tif',
   'title': 'Download HLS.L30.T13TGF.2021124T173013.v2.0.VAA.tif'},
  'B05': {'href': 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/HLSL30.020/HLS.L30.T13TGF.2021124T173013.v2.0/HLS.L30.T13TGF.2021124T173013.v2.0.B05.tif',
   'title': 'Download HLS.L30.T13TGF.2021124T173013.v2.0.B05.tif'},
  'B02': {'href': 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/HLSL30.020/HLS.L30.T13TGF.2021124T173013.v2.0/HLS.L30.T13TGF.2021124T173013.v2.0.B02.tif',
   'title': 'Download HLS.L30.T13TGF.2021124T173013.v2.0.B02.tif'},
  'Fmask': {'href': 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/HLSL30.020/HLS.L30.T13TGF.2021124T173013.v2.0/HLS.L30.T13TGF.2021124T173013.v2.0.Fmask.tif',
   'title': 'Download HLS.L30.T13TGF.2021124T173013.v2.0.Fmask.tif'},
  'B03': {'href': 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/HLSL30.020/HLS.L30.T13TGF.2021124T173013.v2.0/HLS.L30.T13TGF.2021124T173013.v2.0.B03.tif',
   'title': 'Download HLS.L30.T13TGF.2021124T173013.v2.0.B03.tif'},
  'B04': {'href': 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/HLSL30.020/HLS.L30.T13TGF.2021124T173013.v2.0/HLS.L30.T13TGF.2021124T173013.v2.0.B04.tif',
   'title': 'Download HLS.L30.T13TGF.2021124T173013.v2.0.B04.tif'},
  'browse': {'href': 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-public/HLSL30.020/HLS.L30.T13TGF.2021124T173013.v2.0/HLS.L30.T13TGF.2021124T173013.v2.0.jpg',
   'type': 'image/jpeg',
   'title': 'Download HLS.L30.T13TGF.2021124T173013.v2.0.jpg'},
  'metadata': {'href': 'https://cmr.earthdata.nasa.gov/search/concepts/G2144020713-LPCLOUD.xml',
   'type': 'application/xml'}},
 'bbox': [-102.638891, 40.510984, -101.289425, 41.526862],
 'stac_extensions': ['https://stac-extensions.github.io/eo/v1.0.0/schema.json'],
 'collection': 'HLSL30.v2.0'}

Filtering STAC Items

Below we will loop through and filter the item_collection by a specified cloud cover as well as extract the band we’d need to do an Enhanced Vegetation Index (EVI) calculation for a future analysis. We will also specify the STAC Assets (i.e., bands/layers) of interest for both the S30 and L30 collections (also in our collections variable above) and print out the first ten links, converted to s3 locations:

cloudcover = 25

s30_bands = ['B8A', 'B04', 'B02', 'Fmask']    # S30 bands for EVI calculation and quality filtering -> NIR, RED, BLUE, Quality 
l30_bands = ['B05', 'B04', 'B02', 'Fmask']    # L30 bands for EVI calculation and quality filtering -> NIR, RED, BLUE, Quality 

evi_band_links = []

for i in item_collection:
    if i.properties['eo:cloud_cover'] <= cloudcover:
        if i.collection_id == 'HLSS30.v2.0':
            #print(i.properties['eo:cloud_cover'])
            evi_bands = s30_bands
        elif i.collection_id == 'HLSL30.v2.0':
            #print(i.properties['eo:cloud_cover'])
            evi_bands = l30_bands

        for a in i.assets:
            if any(b==a for b in evi_bands):
                evi_band_links.append(i.assets[a].href)
                
s3_links = [l.replace('https://data.lpdaac.earthdatacloud.nasa.gov/', 's3://') for l in evi_band_links]
s3_links[:10]
['s3://lp-prod-protected/HLSL30.020/HLS.L30.T13TGF.2021133T172406.v2.0/HLS.L30.T13TGF.2021133T172406.v2.0.B04.tif',
 's3://lp-prod-protected/HLSL30.020/HLS.L30.T13TGF.2021133T172406.v2.0/HLS.L30.T13TGF.2021133T172406.v2.0.B05.tif',
 's3://lp-prod-protected/HLSL30.020/HLS.L30.T13TGF.2021133T172406.v2.0/HLS.L30.T13TGF.2021133T172406.v2.0.Fmask.tif',
 's3://lp-prod-protected/HLSL30.020/HLS.L30.T13TGF.2021133T172406.v2.0/HLS.L30.T13TGF.2021133T172406.v2.0.B02.tif',
 's3://lp-prod-protected/HLSL30.020/HLS.L30.T14TKL.2021133T172406.v2.0/HLS.L30.T14TKL.2021133T172406.v2.0.B02.tif',
 's3://lp-prod-protected/HLSL30.020/HLS.L30.T14TKL.2021133T172406.v2.0/HLS.L30.T14TKL.2021133T172406.v2.0.B04.tif',
 's3://lp-prod-protected/HLSL30.020/HLS.L30.T14TKL.2021133T172406.v2.0/HLS.L30.T14TKL.2021133T172406.v2.0.B05.tif',
 's3://lp-prod-protected/HLSL30.020/HLS.L30.T14TKL.2021133T172406.v2.0/HLS.L30.T14TKL.2021133T172406.v2.0.Fmask.tif',
 's3://lp-prod-protected/HLSS30.020/HLS.S30.T14TKL.2021133T173859.v2.0/HLS.S30.T14TKL.2021133T173859.v2.0.B04.tif',
 's3://lp-prod-protected/HLSS30.020/HLS.S30.T14TKL.2021133T173859.v2.0/HLS.S30.T14TKL.2021133T173859.v2.0.B8A.tif']

Access s3 storage location

Access s3 credentials from LP.DAAC and create a boto3 Session object using your temporary credentials. This Session is used to pass credentials and configuration to AWS so we can interact wit S3 objects from applicable buckets.

s3_cred_endpoint = 'https://data.lpdaac.earthdatacloud.nasa.gov/s3credentials'
temp_creds_req = requests.get(s3_cred_endpoint).json()

session = boto3.Session(aws_access_key_id=temp_creds_req['accessKeyId'], 
                        aws_secret_access_key=temp_creds_req['secretAccessKey'],
                        aws_session_token=temp_creds_req['sessionToken'],
                        region_name='us-west-2')

GDAL Configurations

GDAL is a foundational piece of geospatial software that is leveraged by several popular open-source, and closed, geospatial software. The rasterio package is no exception. Rasterio leverages GDAL to, among other things, read and write raster data files, e.g., GeoTIFFs/Cloud Optimized GeoTIFFs. To read remote files, i.e., files/objects stored in the cloud, GDAL uses its Virtual File System API. In a perfect world, one would be able to point a Virtual File System (there are several) at a remote data asset and have the asset retrieved, but that is not always the case. GDAL has a host of configurations/environmental variables that adjust its behavior to, for example, make a request more performant or to pass AWS credentials to the distribution system. Below, we’ll identify the evironmental variables that will help us get our data from cloud

rio_env = rio.Env(AWSSession(session),
                  GDAL_DISABLE_READDIR_ON_OPEN='TRUE',
                  GDAL_HTTP_COOKIEFILE=os.path.expanduser('~/cookies.txt'),
                  GDAL_HTTP_COOKIEJAR=os.path.expanduser('~/cookies.txt'))
rio_env.__enter__()
<rasterio.env.Env at 0x7fa9580f1e80>
s3_url = 's3://lp-prod-protected/HLSL30.020/HLS.L30.T11SQA.2021333T181532.v2.0/HLS.L30.T11SQA.2021333T181532.v2.0.B04.tif'
# s3_url = 's3://lp-prod-protected/HLSL30.020/HLS.L30.T13TGF.2021133T172406.v2.0/HLS.L30.T13TGF.2021133T172406.v2.0.B04.tif'

Read Cloud-Optimized GeoTIFF into rioxarray

da = rioxarray.open_rasterio(s3_url)
da
<xarray.DataArray (band: 1, y: 3660, x: 3660)>
[13395600 values with dtype=int16]
Coordinates:
  * band         (band) int64 1
  * x            (x) float64 7e+05 7e+05 7e+05 ... 8.097e+05 8.097e+05 8.097e+05
  * y            (y) float64 4.1e+06 4.1e+06 4.1e+06 ... 3.99e+06 3.99e+06
    spatial_ref  int64 0
Attributes:
    _FillValue:    -9999.0
    scale_factor:  0.0001
    add_offset:    0.0
    long_name:     Red

When GeoTIFFS/Cloud Optimized GeoTIFFS are read in, a band coordinate variable is automatically created (see the print out above). In this exercise we will not use that coordinate variable, so we will remove it using the squeeze() function to avoid confusion.

da_red = da.squeeze('band', drop=True)
da_red
<xarray.DataArray (y: 3660, x: 3660)>
[13395600 values with dtype=int16]
Coordinates:
  * x            (x) float64 7e+05 7e+05 7e+05 ... 8.097e+05 8.097e+05 8.097e+05
  * y            (y) float64 4.1e+06 4.1e+06 4.1e+06 ... 3.99e+06 3.99e+06
    spatial_ref  int64 0
Attributes:
    _FillValue:    -9999.0
    scale_factor:  0.0001
    add_offset:    0.0
    long_name:     Red

Plot using hvplot

da_red.hvplot.image(x='x', y='y', cmap='gray', aspect='equal')
Unable to display output for mime type(s): 
rio_env.__exit__()

Part 3: Working with Zarr-formatted data using NASA’s Harmony cloud transformation service

We have already explored direct access to the NASA EOSDIS archive in the cloud via the Amazon Simple Storage Service (S3). In addition to directly accessing the files archived and distributed by each of the NASA DAACs, many datasets also support services that allow us to customize the data via subsetting, reformatting, reprojection, and other transformations.

This example demonstrates “analysis in place” using customized ECCO Level 4 monthly sea surface height data, in this case reformatted to Zarr, from a new ecosystem of services operating within the NASA Earthdata Cloud: NASA Harmony:

  • Consistent access patterns to EOSDIS holdings make cross-data center data access easier
  • Data reduction services allow us to request only the data we want, in the format and projection we want
  • Analysis Ready Data and cloud access will help reduce time-to-science
  • Community Development helps reduce the barriers for re-use of code and sharing of domain knowledge

Using Harmony-Py to customize data

Harmony-Py provides a pip installable Python alternative to directly using Harmony’s RESTful API to make it easier to request data and service options, especially when interacting within a Python Jupyter Notebook environment.

Create Harmony Client object

First, we need to create a Harmony Client, which is what we will interact with to submit and inspect a data request to Harmony, as well as to retrieve results.

harmony_client = Client()

Create Harmony Request

Specify a temporal range over 2015, and Zarr as an output format.

Zarr is an open source library for storing N-dimensional array data. It supports multidimensional arrays with attributes and dimensions similar to NetCDF4, and it can be read by XArray. Zarr is often used for data held in cloud object storage (like Amazon S3), because it is better optimized for these situations than NetCDF4.

short_name = 'ECCO_L4_SSH_05DEG_MONTHLY_V4R4'

request = Request(
    collection=Collection(id=short_name),
    temporal={
        'start': dt.datetime(2015, 1, 2),
        'stop': dt.datetime(2015, 12, 31),
    },
    format='application/x-zarr'
)

job_id = harmony_client.submit(request)

Check request status and view output URLs

Harmony data outputs can be accessed within the cloud using the s3 URLs and AWS credentials provided in the Harmony job response:

harmony_client.wait_for_processing(job_id, show_progress=True)

results = harmony_client.result_urls(job_id, link_type=LinkType.s3)
s3_urls = list(results)
s3_urls
 [ Processing: 100% ] |###################################################| [|]
['s3://harmony-prod-staging/public/harmony/netcdf-to-zarr/dae0e7b9-26d1-4db0-a26e-67e1ea8ce2f0/SEA_SURFACE_HEIGHT_mon_mean_2015-02_ECCO_V4r4_latlon_0p50deg.zarr',
 's3://harmony-prod-staging/public/harmony/netcdf-to-zarr/dae0e7b9-26d1-4db0-a26e-67e1ea8ce2f0/SEA_SURFACE_HEIGHT_mon_mean_2015-01_ECCO_V4r4_latlon_0p50deg.zarr',
 's3://harmony-prod-staging/public/harmony/netcdf-to-zarr/dae0e7b9-26d1-4db0-a26e-67e1ea8ce2f0/SEA_SURFACE_HEIGHT_mon_mean_2015-11_ECCO_V4r4_latlon_0p50deg.zarr',
 's3://harmony-prod-staging/public/harmony/netcdf-to-zarr/dae0e7b9-26d1-4db0-a26e-67e1ea8ce2f0/SEA_SURFACE_HEIGHT_mon_mean_2015-12_ECCO_V4r4_latlon_0p50deg.zarr',
 's3://harmony-prod-staging/public/harmony/netcdf-to-zarr/dae0e7b9-26d1-4db0-a26e-67e1ea8ce2f0/SEA_SURFACE_HEIGHT_mon_mean_2015-03_ECCO_V4r4_latlon_0p50deg.zarr',
 's3://harmony-prod-staging/public/harmony/netcdf-to-zarr/dae0e7b9-26d1-4db0-a26e-67e1ea8ce2f0/SEA_SURFACE_HEIGHT_mon_mean_2015-04_ECCO_V4r4_latlon_0p50deg.zarr',
 's3://harmony-prod-staging/public/harmony/netcdf-to-zarr/dae0e7b9-26d1-4db0-a26e-67e1ea8ce2f0/SEA_SURFACE_HEIGHT_mon_mean_2015-05_ECCO_V4r4_latlon_0p50deg.zarr',
 's3://harmony-prod-staging/public/harmony/netcdf-to-zarr/dae0e7b9-26d1-4db0-a26e-67e1ea8ce2f0/SEA_SURFACE_HEIGHT_mon_mean_2015-06_ECCO_V4r4_latlon_0p50deg.zarr',
 's3://harmony-prod-staging/public/harmony/netcdf-to-zarr/dae0e7b9-26d1-4db0-a26e-67e1ea8ce2f0/SEA_SURFACE_HEIGHT_mon_mean_2015-07_ECCO_V4r4_latlon_0p50deg.zarr',
 's3://harmony-prod-staging/public/harmony/netcdf-to-zarr/dae0e7b9-26d1-4db0-a26e-67e1ea8ce2f0/SEA_SURFACE_HEIGHT_mon_mean_2015-08_ECCO_V4r4_latlon_0p50deg.zarr',
 's3://harmony-prod-staging/public/harmony/netcdf-to-zarr/dae0e7b9-26d1-4db0-a26e-67e1ea8ce2f0/SEA_SURFACE_HEIGHT_mon_mean_2015-09_ECCO_V4r4_latlon_0p50deg.zarr',
 's3://harmony-prod-staging/public/harmony/netcdf-to-zarr/dae0e7b9-26d1-4db0-a26e-67e1ea8ce2f0/SEA_SURFACE_HEIGHT_mon_mean_2015-10_ECCO_V4r4_latlon_0p50deg.zarr']
results = harmony_client.result_urls(job_id, link_type=LinkType.s3)
urls = list(results)
pprint(urls)

AWS credential retrieval

Using aws_credentials you can retrieve the credentials needed to access the Harmony s3 staging bucket and its contents.

creds = harmony_client.aws_credentials()

Open staged files with s3fs and xarray

Access AWS credentials for the Harmony bucket, and use the AWS s3fs package to create a file system that can then be read by xarray:

creds = harmony_client.aws_credentials()

s3_fs = s3fs.S3FileSystem(
    key=creds['aws_access_key_id'],
    secret=creds['aws_secret_access_key'],
    token=creds['aws_session_token'],
    client_kwargs={'region_name':'us-west-2'},
)

Open the Zarr stores using the s3fs package, then load them all at once into a concatenated xarray dataset:

stores = [s3fs.S3Map(root=url, s3=s3_fs, check=False) for url in s3_urls]
stores
[<fsspec.mapping.FSMap at 0x7fa953363820>,
 <fsspec.mapping.FSMap at 0x7fa9533630d0>,
 <fsspec.mapping.FSMap at 0x7fa9533635b0>,
 <fsspec.mapping.FSMap at 0x7fa953363670>,
 <fsspec.mapping.FSMap at 0x7fa9533633d0>,
 <fsspec.mapping.FSMap at 0x7fa953363610>,
 <fsspec.mapping.FSMap at 0x7fa9533637f0>,
 <fsspec.mapping.FSMap at 0x7fa9533636a0>,
 <fsspec.mapping.FSMap at 0x7fa953363790>,
 <fsspec.mapping.FSMap at 0x7fa953363190>,
 <fsspec.mapping.FSMap at 0x7fa953363400>,
 <fsspec.mapping.FSMap at 0x7fa953363460>]

def open_zarr_xarray(store):
    return xr.open_zarr(store=store, consolidated=True)

datasets = pqdm(stores, open_zarr_xarray, n_jobs=12)
# datasets
ds = xr.concat(datasets, 'time', coords='minimal', )
ds = xr.decode_cf(ds, mask_and_scale=True, decode_coords=True)
ds
<xarray.Dataset>
Dimensions:         (longitude: 720, time: 12, latitude: 360, nv: 2)
Coordinates:
  * longitude       (longitude) float64 -179.8 -179.2 -178.8 ... 179.2 179.8
  * latitude        (latitude) float32 -89.75 -89.25 -88.75 ... 89.25 89.75
    latitude_bnds   (latitude, nv) float32 ...
    longitude_bnds  (longitude, nv) float32 ...
  * time            (time) datetime64[ns] 2015-02-15 ... 2015-10-16T12:00:00
    time_bnds       (time, nv) datetime64[ns] dask.array<chunksize=(1, 2), meta=np.ndarray>
Dimensions without coordinates: nv
Data variables:
    SSH             (time, latitude, longitude) float32 dask.array<chunksize=(1, 360, 720), meta=np.ndarray>
    SSHIBC          (time, latitude, longitude) float32 dask.array<chunksize=(1, 360, 720), meta=np.ndarray>
    SSHNOIBC        (time, latitude, longitude) float32 dask.array<chunksize=(1, 360, 720), meta=np.ndarray>
Attributes: (12/57)
    Conventions:                  CF-1.8, ACDD-1.3
    acknowledgement:              This research was carried out by the Jet Pr...
    author:                       Ian Fenty and Ou Wang
    cdm_data_type:                Grid
    comment:                      Fields provided on a regular lat-lon grid. ...
    coordinates_comment:          Note: the global 'coordinates' attribute de...
    ...                           ...
    time_coverage_duration:       P1M
    time_coverage_end:            2015-03-01T00:00:00
    time_coverage_resolution:     P1M
    time_coverage_start:          2015-02-01T00:00:00
    title:                        ECCO Sea Surface Height - Monthly Mean 0.5 ...
    uuid:                         0894e538-4158-11eb-a982-0cc47a3f47f1

xarray.decode_cf(obj, concat_characters=True, mask_and_scale=True, decode_times=True, decode_coords=True, drop_variables=None, use_cftime=None, decode_timedelta=None)[source]

ssh_ds = xr.open_mfdataset(fileset, combine=‘by_coords’, mask_and_scale=True, decode_cf=True, chunks=‘auto’)

ssh_da = ds.SSH
ssh_da

XArray reads data lazily, i.e. only when our code actually needs it. Up to this point, we haven’t read any data values, only metadata. The next line will force XArray to read the portions of the source files containing our area of interest. Behind the scenes, the eosdis-zarr-store library is ensuring data is fetched as efficiently as possible.

Note: This line isn’t strictly necessary, since XArray will automatically read the data we need the first time our code tries to use it, but calling this will make sure that we can read the data multiple times later on without re-fetching anything from the source files.

This line will take several seconds to complete, but since it is retrieving only about 50 MB of data from 22 GB of source files, several seconds constitutes a significant time, bandwidth, and disk space savings.

ssh_da.load();

Now we can start looking at aggregations across the time dimension. In this case, plot the standard deviation of the temperature at each point to get a visual sense of how much temperatures fluctuate over the course of the month.

# We expect a warning here, from finding the standard deviation of arrays that contain all N/A values.
# numpy produces N/A for these points, though, which is exactly what we want.
stdev_ssh = ssh_da.std('time')
stdev_ssh.name = 'stdev of analysed_sst [Kelvin]'
stdev_ssh.plot();

s3fs sessions are used for authenticated access to s3 bucket and allows for typical file-system style operations. Below we create session by passing in the temporary credentials we recieved from our temporary credentials endpoint.## Set up an s3fs session for Direct Access

s3fs sessions are used for authenticated access to s3 bucket and allows for typical file-system style operations. Below we create session by passing in the temporary credentials we recieved from our temporary credentials endpoint.

Plot the SSH time series using hvplot


ssh_da.hvplot.image(y='latitude', x='longitude', cmap='Viridis',).opts(clim=(ssh_da.attrs['valid_min'],ssh_da.attrs['valid_max']))
# import hvplot.xarray
# ssh_da.hvplot.image(y='latitude', x='longitude', cmap='Viridis',).opts(clim=(ssh_da.attrs['valid_min'][0],ssh_da.attrs['valid_max'][0]))

Further Resources

  • Reference Hackathon/workshop tutorials that go into more detail!
  • Earthdata Cloud Cookbook
  • Earthdata Cloud Primer
    • Getting started with Amazon Web Services outside of the Workshop to access and work with data with a cloud environment.